A/B testing

Cookie Cats A/B testing

In this notebook, we 're going to look at data from Cookie Cats which is a hugely successful smartphone puzzle game developed by Tactile Entertainment.

As players advance through the game levels, they will periodically experience gates that require them to wait a non-trivial period of time or to make a significant in-app purchase. These gates, in addition to driving in-app purchases, serve the important function of offering players an enforced break from playing the game, which will ideally result in the player 's enjoyment of play.

The first gate was initially put at level 30 but we will examine an AB-test in this notebook, where we moved the first gate in Cookie Cats from level 30 to level 40. Particularly we will look at the effect on retention of players.

This notebook is intended to help the organization make a decision about whether to insert the Gate after stage 30 or 40 based on data obtained from an experiment.

We can get to a conclusion just by averaging the retention of players base on stage number which the gate was on and see in which stage number that the Gate performed the best, but we need to see if there is a significance in the result.

In this notebook we will cover:

-Exploratory data analysis

-Data Manupilation

-Bayesian analysis

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
In [2]:
df=pd.read_csv('./dataset/cookie_cats.csv')

Exploratory data analysis

Back to top

As we see from the graph below, the distribution of "sum_gameround" which represents the number of rounds played by a player, there is a ton of outliers.

In [3]:
px.histogram(df,x='sum_gamerounds',log_y=False,marginal="box",color='version')

After we remove 5 percent of extreme data now we can see clearly that nearly 56000 players play less than 30 rounds, meaning they have not even reached the first gate at level 30.

We will remove them for bayesian analysis as our aim is to see if player retained after seeing the gate at a corresponding stage and conclude the gate's best placement either at level 30 or 40.

In [4]:
px.histogram(df,x='sum_gamerounds',log_y=False,marginal="box" ,range_x=list(df['sum_gamerounds'].quantile([0,0.950])),
             color='version',barmode='overlay',opacity=0.75)

Data Manupilation

Back to top

In this table, we see the amount of players retention after 1 and 7 days for both versions.

In the level 30 gate placement:

2256 players didn't retain in day one nor after 7 days.
1184 players didn't retain in day one but they did come back after 7 days.
7342 players did retain for one day but didn't retain after 7 days.
6176 players did retain for one day and after 7 days.

In the level 40 gate placement:

1443 players didn't retain in day one nor after 7 days.
980 players didn't retain in day one but they did come back after 7 days.
5870 players did retain for one day but didn't retain after 7 days.
5776 players did retain for one day and after 7 days.

In [5]:
df["Version_N"]=pd.to_numeric(df["version"].apply(lambda x:x.split('_')[1]))
df30=df[(df['Version_N']==30) & (df['sum_gamerounds']>=29)]
df40=df[(df['Version_N']==40) & (df['sum_gamerounds']>=39)]
df=pd.concat([df30,df40]).reset_index(drop=True)
df.groupby(["version","retention_1",'retention_7']).count()['userid'].unstack()
Out[5]:
retention_7 False True
version retention_1
gate_30 False 2256 1184
True 7342 6176
gate_40 False 1443 980
True 5870 5776

Here is a more visually appealing version of the table above.

In [6]:
px.parallel_categories(df,dimensions=["retention_1",'retention_7'],color='Version_N',
                      color_continuous_scale=["Blue","Red"])

Bayesian analysis

Back to top

Baseyen analysis for retention player on first day

If we separate the data based on the gate placement and we calculate the percentage of retention (or in this case, we can do it with the mean since all values are 0 and 1), we see that gate at level 40 have more retention rate (+3.06%).


An increase of 3.06% might seem small, but for a company that makes a lot of money that's a huge revenue, however, is this number is enough to say that gate at stage 40 retain players better than stage 30?

In [7]:
gate_30=df[df['version']=="gate_30"]['retention_1'].astype(int).reset_index(drop=True)
gate_40=df[df['version']=="gate_40"]['retention_1'].astype(int).reset_index(drop=True)
In [8]:
print("Level 30 gate retention rate is:",round(sum(gate_30)/len(gate_30)*100,2),"%")
Level 30 gate retention rate is: 79.71 %
In [9]:
print("Level 40 gate retention rate is:",round(sum(gate_40)/len(gate_40)*100,2),"%")
Level 40 gate retention rate is: 82.78 %
In [10]:
print("The dififrance is:",round((gate_40.mean()-gate_30.mean())*100,2),"%")
The dififrance is: 3.06 %

Calculation

H: Gate at stage 40 will retain more players that stage 30
H0: There is no significant diffirance between the two, and the fact that gate at level 40 retain more player in this sample is due to randomness.

Here we are aiming to calculate the p-value, a p-value of less than 1% means that the conclusion we assume base on the above calculation wasn't due to randomness and that the gate at level 40 will perform better than at gate 30.

I use the permutation method wish to assume that there is no difference between gate at 30 or gate at 40, we merge, shuffle and cut the data under the null hypothesis assumption and calculating the p-value, wish to tell us what is the probability of the null hypothesis being true.

and as we see p-value is eaqual to 0% whish meanthere is 0% chance that the null hypothesis being true which means that we reject the idea that there is no difference between the gate at level 30 or 40

from the graph below we can see that with the assumption that we made that led us to merge the two groups and shuffle them we couldn't even get one case that gets close to the average in the original data(black line 3.6%)

Whish at the end made us conclude that we need to change the gate and made it at level 40 instead of 30.

In [11]:
ss=5000
perm=np.empty(ss)
boot=np.empty(ss)
for i in range(ss):
    concat=np.concatenate((gate_40,gate_30))
    perm_m=np.random.permutation(concat)
    perm_40=perm_m[:len(gate_40)]
    perm_30=perm_m[len(gate_40):]
    perm[i]=np.mean(perm_40)-np.mean(perm_30)

    boot[i]=np.mean(np.random.choice(gate_40,size=len(gate_40)))-np.mean(np.random.choice(gate_30,size=len(gate_30)))
pvalue=np.sum(perm>=gate_40.mean()-gate_30.mean())/len(perm)
print('P_value=',pvalue*100,"%")
P_value= 0.0 %
In [12]:
retantion_1_seg=pd.concat([pd.DataFrame(perm,columns=['permutation_avg']),pd.DataFrame(boot,columns=['Bootstrap_avg'])],axis=1)
In [13]:
retantion_1_seg=retantion_1_seg.melt(value_vars=['permutation_avg',"Bootstrap_avg"],var_name='type',value_name='avg')
In [14]:
fig=px.histogram(retantion_1_seg,x="avg",color='type',marginal='box',barmode='overlay',opacity=0.75)
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0, y1= 1,
      xref= 'x', x0= gate_40.mean()-gate_30.mean(), x1= gate_40.mean()-gate_30.mean()
    )
])
fig.show()

Baseyen analysis for retention player on first day

The same drill here, we see that gate at level 40 retain more players, we do baysian calculation, and we found that the result is indeed significant and that gate 40 will preform better.

In [15]:
gate_30=df[df['version']=="gate_30"]['retention_7'].astype(int).reset_index(drop=True)
gate_40=df[df['version']=="gate_40"]['retention_7'].astype(int).reset_index(drop=True)
In [16]:
print("Level 30 gate retention rate is:",round(sum(gate_30)/len(gate_30)*100,2),"%")
Level 30 gate retention rate is: 43.4 %
In [17]:
print("Level 40 gate retention rate is:",round(sum(gate_40)/len(gate_40)*100,2),"%")
Level 40 gate retention rate is: 48.02 %
In [18]:
print("The dififrance is:",round((gate_40.mean()-gate_30.mean())*100,2),"%")
The dififrance is: 4.62 %
In [19]:
ss=5000
perm=np.empty(ss)
boot=np.empty(ss)
for i in range(ss):
    concat=np.concatenate((gate_40,gate_30))
    perm_m=np.random.permutation(concat)
    perm_40=perm_m[:len(gate_40)]
    perm_30=perm_m[len(gate_40):]
    perm[i]=np.mean(perm_40)-np.mean(perm_30)

    boot[i]=np.mean(np.random.choice(gate_40,size=len(gate_40)))-np.mean(np.random.choice(gate_30,size=len(gate_30)))
pvalue=np.sum(perm>=gate_40.mean()-gate_30.mean())/len(perm)
print('P_value=',pvalue*100,"%")
P_value= 0.0 %
In [20]:
retantion_1_seg=pd.concat([pd.DataFrame(perm,columns=['permutation_avg']),pd.DataFrame(boot,columns=['Bootstrap_avg'])],axis=1)
retantion_1_seg=retantion_1_seg.melt(value_vars=['permutation_avg',"Bootstrap_avg"],var_name='type',value_name='avg')
In [21]:
fig=px.histogram(retantion_1_seg,x="avg",color='type',marginal='box',barmode='overlay',opacity=0.75)
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0= 0, y1= 1,
      xref= 'x', x0= gate_40.mean()-gate_30.mean(), x1= gate_40.mean()-gate_30.mean()
    )
])
fig.show()

Conclusion

Back to top

Placing the gate at stage 40 is a better option than placing it on stage 30